[opt](rowset) Aggregate non-MOW segment key bounds to reduce rowset meta size#62604
Conversation
…eta size For non-MOW (duplicate / aggregate key) tables, per-segment key bounds are not consumed on the read path — only the rowset-level [min, max] is used by the reader and ordered compaction. In cloud mode, persisting bounds for every segment can blow past FDB's value size limit on commit_rowset. Introduce an `enable_aggregate_non_mow_key_bounds` BE config (default off). When enabled, non-MOW rowsets collapse per-segment bounds into a single [overall_min, overall_max] entry at write time, and compaction preserves this behavior. MOW rowsets always retain per-segment bounds — their `lookup_row_key` path relies on them for delete bitmap computation, and is guarded by a new DCHECK against aggregated input. A new optional `segments_key_bounds_aggregated` flag is added to both RowsetMetaPB and RowsetMetaCloudPB so consumers can distinguish aggregated from per-segment layouts. Proto round-trip, pb_convert, snapshot restore, and index builder all preserve both this flag and the existing `segments_key_bounds_truncated` flag. Correctness notes: - `first_key/last_key` callers (`block_reader`, ordered compaction) already bail out on overlapping rowsets, so for non-overlapping rowsets the aggregated [min, max] equals seg[0].min / seg[last].max exactly. - `merge_rowset_meta` (MOW partial-update publish) DCHECKs both sides are non-aggregated.
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
There was a problem hiding this comment.
Pull request overview
This PR introduces an optional optimization to reduce rowset metadata size for non-MOW (duplicate/aggregate-key) tables by aggregating per-segment key bounds into a single rowset-level [min, max] entry, primarily targeting cloud mode FDB value size limits.
Changes:
- Add
enable_aggregate_non_mow_key_boundsBE config (default off) and use it in rowset write + ordered compaction to optionally aggregate non-MOWsegments_key_bounds. - Add
segments_key_bounds_aggregatedto RowsetMeta protobufs and propagate it through pb_convert, snapshot restore, index builder, and rowset meta helpers. - Add BE unit tests and a regression suite validating aggregated/non-aggregated behavior for non-MOW vs MOW tables.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| regression-test/suites/data_model_p0/duplicate/test_non_mow_key_bounds_aggregation.groovy | Regression coverage for config on/off and MOW vs non-MOW layout expectations. |
| gensrc/proto/olap_file.proto | Adds segments_key_bounds_aggregated flag to RowsetMetaPB/RowsetMetaCloudPB. |
| be/test/storage/rowset/rowset_meta_test.cpp | Unit tests for aggregation behavior and truncation interaction. |
| be/src/storage/task/index_builder.cpp | Preserves aggregated/truncated flags when rebuilding rowset meta for index tasks. |
| be/src/storage/tablet/base_tablet.cpp | Adds DCHECK to ensure MOW lookup path never sees aggregated bounds. |
| be/src/storage/rowset/rowset_meta.h | Adds aggregated flag accessors and extends set_segments_key_bounds API. |
| be/src/storage/rowset/rowset_meta.cpp | Implements aggregation in set_segments_key_bounds; adds DCHECK in merge_rowset_meta. |
| be/src/storage/rowset/rowset.h | Exposes is_segments_key_bounds_aggregated() on Rowset. |
| be/src/storage/rowset/beta_rowset_writer.cpp | Aggregates non-MOW key bounds at write time based on new config. |
| be/src/storage/compaction/compaction.cpp | Aggregates non-MOW key bounds for ordered compaction output based on new config. |
| be/src/common/config.h / be/src/common/config.cpp | Declares/defines enable_aggregate_non_mow_key_bounds. |
| be/src/cloud/pb_convert.cpp | Propagates aggregated flag between Doris and cloud rowset meta PBs. |
| be/src/cloud/cloud_snapshot_mgr.cpp | Preserves aggregated/truncated flags during snapshot rowset meta creation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
run buildall |
1b267ba to
89e4675
Compare
|
run buildall |
|
OpenCode automated review failed and did not complete. Error: Review step was failure (possibly timeout or cancelled) Please inspect the workflow logs and rerun the review after the underlying issue is resolved. |
FE UT Coverage ReportIncrement line coverage `` 🎉 |
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
There was a problem hiding this comment.
I found one blocking correctness issue.\n\n1. \nThe new aggregated key-bounds flag is not actually preserved in the index rewrite path. is immediately overwritten by the following call, which recomputes the flag as when already contains the single aggregated entry. Rebuilding an aggregated non-MOW rowset via ADD/DROP INDEX therefore emits metadata with but .\n\nCheckpoint summary:\n- Goal: The PR is meant to aggregate non-MOW key bounds and preserve the new layout/flag across write, compaction, cloud-conversion, restore, and index-rewrite paths. The current head still misses the index-rewrite path, so the goal is not fully met.\n- Tests: Unit coverage for and regression coverage for write-time aggregation look good. I did not find coverage for ADD/DROP INDEX on an already aggregated non-MOW rowset, which is why this regression slips through.\n- Concurrency/lifecycle: I did not find a new concurrency or lifecycle issue in the touched paths.\n- Config: The new mutable BE config is observed in the write and ordered-compaction paths. Note that the code currently defaults it to , which is broader than the PR text's "default off".\n- Compatibility/persistence: The proto fields and cloud/restore conversion paths are preserved correctly in the current head.\n- Performance: The intended metadata-size reduction looks fine where the aggregated flag is preserved.
|
Correction to my earlier review text: the blocking issue is in be/src/storage/task/index_builder.cpp. rowset_meta->set_segments_key_bounds_aggregated(input_rowset_meta->is_segments_key_bounds_aggregated()) is immediately overwritten by the next rowset_meta->set_segments_key_bounds(key_bounds) call. When the input rowset is already aggregated, key_bounds contains the single aggregated entry, but set_segments_key_bounds(key_bounds) runs with aggregate_into_single=false, so the rebuilt rowset loses the aggregated flag while still keeping num_segments > 1 and segments_key_bounds.size() == 1. Ordered compaction relies on the aggregated flag to preserve that layout when the config is off, so ADD/DROP INDEX can emit inconsistent rowset metadata here. Please pass the source flag into set_segments_key_bounds(key_bounds, input_rowset_meta->is_segments_key_bounds_aggregated()) or remove the standalone setter, and add a test that rebuilds an already-aggregated non-MOW rowset through the index path. |
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage `` 🎉 |
FE Regression Coverage ReportIncrement line coverage |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
PR approved by at least one committer and no changes requested. |
…eta size (#62604) For non-MOW (duplicate / aggregate key) tables, per-segment key bounds are not consumed on the read path — only the rowset-level [min, max] is used by the reader and ordered compaction. In cloud mode, persisting bounds for every segment can blow past FDB's value size limit on commit_rowset. Introduce an `enable_aggregate_non_mow_key_bounds` BE config (default off). When enabled, non-MOW rowsets collapse per-segment bounds into a single [overall_min, overall_max] entry at write time, and compaction preserves this behavior. MOW rowsets always retain per-segment bounds — their `lookup_row_key` path relies on them for delete bitmap computation, and is guarded by a new DCHECK against aggregated input. A new optional `segments_key_bounds_aggregated` flag is added to both RowsetMetaPB and RowsetMetaCloudPB so consumers can distinguish aggregated from per-segment layouts. Proto round-trip, pb_convert, snapshot restore, and index builder all preserve both this flag and the existing `segments_key_bounds_truncated` flag. Correctness notes: - `first_key/last_key` callers (`block_reader`, ordered compaction) already bail out on overlapping rowsets, so for non-overlapping rowsets the aggregated [min, max] equals seg[0].min / seg[last].max exactly. - `merge_rowset_meta` (MOW partial-update publish) DCHECKs both sides are non-aggregated.
…eta size (#62604) For non-MOW (duplicate / aggregate key) tables, per-segment key bounds are not consumed on the read path — only the rowset-level [min, max] is used by the reader and ordered compaction. In cloud mode, persisting bounds for every segment can blow past FDB's value size limit on commit_rowset. Introduce an `enable_aggregate_non_mow_key_bounds` BE config (default off). When enabled, non-MOW rowsets collapse per-segment bounds into a single [overall_min, overall_max] entry at write time, and compaction preserves this behavior. MOW rowsets always retain per-segment bounds — their `lookup_row_key` path relies on them for delete bitmap computation, and is guarded by a new DCHECK against aggregated input. A new optional `segments_key_bounds_aggregated` flag is added to both RowsetMetaPB and RowsetMetaCloudPB so consumers can distinguish aggregated from per-segment layouts. Proto round-trip, pb_convert, snapshot restore, and index builder all preserve both this flag and the existing `segments_key_bounds_truncated` flag. Correctness notes: - `first_key/last_key` callers (`block_reader`, ordered compaction) already bail out on overlapping rowsets, so for non-overlapping rowsets the aggregated [min, max] equals seg[0].min / seg[last].max exactly. - `merge_rowset_meta` (MOW partial-update publish) DCHECKs both sides are non-aggregated.
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
For non-MOW (duplicate / aggregate key) tables, per-segment key bounds are not consumed on the read path — only the rowset-level [min, max] is used by the reader and ordered compaction. In cloud mode, persisting bounds for every segment can blow past FDB's value size limit on commit_rowset.
Introduce an
enable_aggregate_non_mow_key_boundsBE config (default off). When enabled, non-MOW rowsets collapse per-segment bounds into a single [overall_min, overall_max] entry at write time, and compaction preserves this behavior. MOW rowsets always retain per-segment bounds — theirlookup_row_keypath relies on them for delete bitmap computation, and is guarded by a new DCHECK against aggregated input.A new optional
segments_key_bounds_aggregatedflag is added to both RowsetMetaPB and RowsetMetaCloudPB so consumers can distinguish aggregated from per-segment layouts. Proto round-trip, pb_convert, snapshot restore, and index builder all preserve both this flag and the existingsegments_key_bounds_truncatedflag.Correctness notes:
first_key/last_keycallers (block_reader, ordered compaction) already bail out on overlapping rowsets, so for non-overlapping rowsets the aggregated [min, max] equals seg[0].min / seg[last].max exactly.merge_rowset_meta(MOW partial-update publish) DCHECKs both sides are non-aggregated.Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)